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1 Introduction 

The MCMC (Markov Chain Monte-Carlo) method |TJ has played 
an important role in study of complex systems with many de- 
grees of freedom. For example, MCMC has been applied to var- 
ious many-body problems such as proteins [2j, spin systems [3], 
and lattice gauge theory [3] . Although the method has achieved 
great success, there are systems where Monte-Carlo sampling 
does not work due to local minima of energy functions. For ex- 
ample, in analysis of protein folding, Monte-Carlo random walks 
are trapped in narrow regions of energy space because there are 
so many local minima. For large proteins, it is a hard problem 
to obtain sufficiently large number of statistical samples because 
it takes very long time for a complicated conformation to escape 
from a local minimum in energy space. 

The multicanonical method PJ6|7] may be useful for resolving 
the local minimum problems. It is a kind of generalized-ensemble 
methods. It has been applied to the above mentioned problems 
and worked nicely. Especially its application to protein folding 
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with the multicanonical MD [8,9] has been remarkable. Recently, 
Ikebe et al have succeeded in folding calculation of a 40-residue 
protein Ml OI L Kamiya et al have demonstrated flexible molecular 



docking between a protein and a ligand [Hj . Since protein struc- 



ture prediction is one of the most important subjects in life sci- 
ences, it has been expected that the multicanonical method will 
make a large progress in clarification of the fundamental laws of 
life and development of high-performance drugs. 

The multicanonical method is based on an artificial ensemble 
that gives a flat probability distribution in energy space. The 
advantage of the method is enhancement of rare configurations, 
which results in random walks in wide range of energy space. In 
this method, multicanonical weights are estimated in an itera- 
tive way. To estimate a multicanonical weight, one has to gener- 
ate a Markov chain many times, which must be sufficiently long 
for accurate estimation. The number of arithmetic operations in- 
creases exponentially as the number of amino-acid residues be- 
comes large. Since actual proteins are composed of more than a 
hundred amino-acid residues, the number of necessary operations 
for folding simulation is huge. It would be a possibility to decrease 
the execution time using massively parallel supercomputers. 

In this paper, we propose an algorithm to generate a Markov 
chain in a parallel way. In MCMC calculations, the detailed bal- 
ance is checked between the last and a newly generated con- 
figuration to determine acceptance or rejection of the new one. 
Accepted configurations constitute a so-called Markov chain. A 
Markov chain is a one-dimensional object that is generated seri- 
ally. Since acceptance or rejection of a new configuration depends 
on the last configuration, naive algorithm for Markov-chain gen- 
eration such as the Metropolis method is essentially serial. At 
a glance, it seems that the MCMC algorithm cannot be paral- 
lelized. In this paper, we are going to show a method to joint mul- 
tiple Markov chains together to make a longer Markov chain. The 
constituent Markov chains are independently generated starting 
from different initial configurations. The essential point of the 
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method is that joints between chains are processed so that the 
detailed balance is satisfied. Based on the detailed balance, un- 
necessary configurations are discarded from each Markov chain. 
The remaining Markov chains are connected together to make 
a longer Markov chain. By repeating this procedure, one can 
increase the length of the Markov chain arbitrarily. In this al- 
gorithm, other operations such as evaluation of energy functions 
are not parallelized. 

To demonstrate how the parallelization algorithm works, we solve 
the two-dimensional Ising model by combining the proposed par- 
allelization algorithm and the multicanonical method. Multicanon- 
ical weights are estimated from histograms, which are obtained 
using the parallelization algorithm. 

In Sec. El we introduce an algorithm to generate a long Markov 
chain in a parallel way. In Sec. [3| we review the multicanoni- 
cal method briefly. In Sec. [H we apply the parallelization algo- 
rithm to the two-dimensional Ising model with the multicanonical 
method. In Sec. \5[ we show numerical results. Sec. El is devoted 
to conclusions. 



2 Parallelization of Markov-Chain Generation 

Let us consider a Markov chain that is composed of M configu- 
rations, 

Ci, C2, . . . , Cm- (2.1) 

We call Ck a configuration, which is a set of values of multiple 
variables. Ck are generated so that configurations distribute ac- 
cording to a specified probability distribution P(C). In actual 
calculations, the number of configurations is finite. Therefore, a 
set of configurations only reproduces P(C) approximately. The 
associated errors vanishes in the limit M — > 00. 
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In MCMC, configurations are generated serially so that the de- 
tailed balance is satisfied. One determines acceptance or rejec- 
tion of a randomly generated configuration Ck by checking the 
detailed balance between Ck-i and At a glance, the MCMC 
algorithm seems to be essentially serial and is not suited to paral- 
lel computation. In order to decrease execution time using parallel 
computers, we propose a simple algorithm to parallelize Markov- 
chain generation. 

Let us consider a computer of which parallelism is p. In our par- 
allelization algorithm, each computing node generates a Markov- 
chain separately. In order to obtain a much longer Markov chain 
having M tota i configurations than a Markov chain having M n0( \ e 
configurations generated by each computing node, the p Markov 
chains are connected satisfying the detailed balance condition as 
follows: 



Algorithm 

1. Set the total histogram zero, h(E) = 0, where E is energy 

■ (i) 

variable. Prepare initial configurations C\ randomly on z-th 



node (i = 0, . . . , p — 1 
2. In each node, generate a Markov chain composed of M no( je 
configurations, 



C 1 <i) -C< i) ----C« oa<) i = 0,...,p-l. (2.2) 



3. Set i = 0. 

(n) 

4. In the z-th node, check the detailed balance between C M 



and Ck \ where Cjj^ with n = (i — l+p) mod p is the last con- 
figuration of the previous node. The index k is increased from 
1 to M no dc till the detailed balance is satisfied. When there is 
a configuration that satisfies the detailed balance, that config- 
uration and the succeeding ones are accepted. When there is 
no accepted configuration, set Cj$ node <— C^ nodc . Make a local 
histogram h\ ocaj \(E) using only the accepted configurations. Cal- 
culate the total histogram, h(E) <— h(E) + h\ oca \(E). Send the 
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values of h(E) and the energy of C^j to the {{i + 1) mod p)- 

th node. (When i = 0, Cjj^ ^ 0I " the previous iteration is used 
for checking of the detailed balance.) 

5. Set % <— i + 1. If % < p, go to step 4. 

6. If the total number of the configurations contained in the to- 
tal histogram h(E) is smaller than the specified value M tota i, 
adjust M no de appropriately so that additional calculations are 
minimized, and then go to step 2. 

We are going to give some remarks on the algorithm below. 

The algorithm produces a Markov chain of which the configura- 
tion number is M tQt& \. Actually, the algorithm outputs the total 
histogram h(E). As shown in the next section, the obtained his- 
togram h(E) is used to estimate a multicanonical weight w(E). 
The algorithm is repeated certain times till a multicanonical 
weight that covers a sufficiently large energy area is obtained. 
Since the algorithm only assumes the detailed balance and does 
not depend on the details of probability distribution, the algo- 
rithm works in any MCMC calculations. 

In each node, the last configuration generated in the previous 
iteration is used for generating a Markov chain as an initial con- 
figuration in the next iteration including when the considered 
weight has been updated. This means that p Markov chains are 
generated independently from first to last. Equilibrium depends 
on the current weight. 

There may be no configuration that satisfies the detailed balance 
in step 4. In this case, all of the configurations contained in that 
computing node is discarded, which results in low efficiency of 
Markov-chain generation. One can increase acceptance rate by 
ordering computing nodes according to energy values of generated 
configurations. We will pursue this technique in the next paper. 



5 



3 The Multicanonical Method 



3. 1 Estimation of Multicanonical Weights 

We are going to combine the algorithm introduced in the previous 
section with the multicanonical method. Let us review the mul- 
ticanonical method briefly. For the details of the multicanonical 
method, see 0J6]J7]. 

Consider a statistical system that is defined with a Boltzmann 
weight 

w B (E) = e-? E , (3.1) 

where (3 = l/k^T. Hereafter, we set &b = 1 for simplicity. 
Probability distribution is given by 

P(E) = c /3 D(E)w B (E), (3.2) 

where D(E) is density of states. The constant cp is determined 
with the normalization condition's P{E) = 1. 

In the multicanonical method, an extended weight wm{E) is de- 
fined so that multicanonical distribution is independent of energy, 

P M (E) = c M D(E)w M (E) ~ c M . (3.3) 

We can obtain the multicanonical weight wm in a recursive way. 
We denote the n-th multicanonical weight as w^ n \ where n = 
1,2, ... ,1. In the initial step, we set = 1, which corresponds 
to high-temperature limit of the Boltzmann weight With the 
w^ n \ we generate a sufficiently long Markov chain and obtain 
a total histogram h^ n \E). Then, we generate the next weight 
w (n+i) us [ n g the current weight and the obtained histogram 
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h (n \E). 



w 



(n) 



w 



{ " +1 KE) = \ g otherwise . (3.4) 



We expect that the weight 10" approaches to the correct multi- 
canonical weight wm if this process is repeated certain times. 

To obtain the multicanonical weight accurately, sufficiently large 
statistics are necessary for generating h^ n \E). We can decrease 
execution time consumed to generate sufficiently long Markov 
chains by making use of the parallelization algorithm introduced 
in Sec. m H2J. 



3.2 Evaluation of Statistical Average 



We can calculate statistical average at arbitrary temperature 
with the following reweighting formula: 

<o ( ,)> s eowpw = £ 'g^ ( l4»U ■ (3-5) 



where x represents multiple variables and is a configuration. 
If the operator O(x) can be represented as a function of energy E 
and the multicanonical weight wm{E) is known, one can evaluate 
Eq. ( 13.51 ) without configurations. In this case, the formula ( 13.51 ) 
reduces to 



, 0(x)) _ ME)wu(E)-^ E 



because the density of states is inversely proportional to the 
weight (D(E) oc l/w(E)) and the histogram has flat distribution. 
In Eq. ( 13.61 ). we can evaluate (O(x)) by taking the summation 
for all possible energy values E. The reduced reweighing formula 



7 



(13.61 ) cannot be used for quantities that are dependent on local 
variables. 



4 Application to the 2D Ising Model 

We are going to solve the two-dimensional Ising model at finite 
temperature combining the parallelization algorithm (Sec. [2]) and 
the multicanonical method (Sec. |3j). 

4.1 The 2D Ising Model 

The model is defined on two dimensional square lattices. The 
number of lattice sites is N = L 2 , where L is the lattice size. We 
assume periodic boundary conditions. The Hamiltonian of the 
model is defined as follows: 



where the summation is taken for all possible nearest-neighbour 
sites. All information of thermodynamics is contained in the par- 
tition function 



Based on this, we calculate energy E, specific heat C, free energy 
F, and entropy S in a statistical way. 




Z= £ e 



(4.2) 



Sl,.-,S N 




(4.3) 
(4.4) 



S=(3(E-F). 



(4.5) 
(4.6) 
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4-2 Coding 



We apply a combination of the parallelization algorithm and the 
multicanonical method to the two-dimensional Ising model. We 
estimate multicanonical weights iteratively with the paralleliza- 
tion algorithm. We implement the codes with FORTRAN77. Our 
code set is composed of the following two parts: (i) generation of 
the multicanonical weight and (ii) evaluation of statistical aver- 
age. These are implemented as separate two codes. Since almost 
all of the arithmetic operations are contained in the part (i), only 
the part (i) is parallelized with MPI-1, which is a library spec- 
ification for message-passing proposed as a standard [151 . The 



obtained multicanonical weight is used as an input to the part 
(ii). We calculate statistical average of energy, specific heat, free 
energy, and entropy. Since these quantities can be represented as 
functions of energy, we can use the reweighting formula ( 13.61 ) in 
the part (ii). Execution time of the part (ii) is very short like a 
second on a Xeon 1.50 GHz processor. 

We store logarithm of the weight, not the weight itself, because 
the absolute values of the weight may be very small. For this 
reason, we perform all necessary operations under logarithm to 
evaluate statistical quantities. 

Each node generates a Markov chain composed of M noc j e samples 
using the current multicanonical weight. When all p nodes have 
done Markov-chain generation, the obtained chains are linked 
together using the detailed balance as explained before. To be 
concrete, an energy value of the last configuration and the total 
histogram h(E) are sent to the next node for the detailed-balance 
checking and histogram generation. The Metropolis method is 
used for the detailed-balance checking. If the total number of ac- 
cepted configurations contained in the total histogram h(E) is 
smaller than the specified sample number M to tai, the chain gen- 
eration process is repeated. If it is larger than M to tai, the next 
weight is calculated using Eq. (13.41) and distributed to all the 
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nodes. Then, a new iteration process is started with the new 
weight. The above iteration process for weight estimation is re- 
peated the specified I times. 

5 Numerical Results 

With the implemented codes, we generate multicanonical weights 
using the parallelization algorithm and calculate statistical quan- 
tities using Eq. ( 13.61 ). Table CD is a list of parameter values used 
to plot Fig. Q] 



Table 1 

Parameter values used for calculation of statistical quantities shown in Fig. (TJ 



Parameter 


Value 


Meaning 


L 


100 


Lattice size 


N 


L 2 


The number of lattice sites 


P 


32 


Parallelism 


Mtotal 


10 8 


The total number of samples for one iteration 


-Mnode 


Mtotal /p 


The number of samples generated by a node for one iteration 


/ 


1300 


The number of iterations for weight generation 



In Fig. [U we compare our results with the exact finite-lattice ones 
obtained by Ferdinand and Fisher [TB] when lattice size is L = 
100. As shown in figures (a), (c), and (d), our results for energy, 
free energy, and entropy agree with the exact ones very well. On 
the other hand, in Fig. [Q (b), basically two results agree but we 
admit slight difference. In general, specific heat is more sensitive 
to errors associated with obtained multicanonical weights than 
energy as seen in the definition of specific heat ( 14.41 ). 

When parallelism p is small like 1 < p < 1000, the number of 
samples generated by each node M noc je is sufficiently large with 
a fixed Mtotal- In this case, errors associated with multicanonical 
weights would be small even if the detailed balance is not imposed 
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J, 



Fig. 1. For L = 100, (a) energy E/N, (b) specific heat C, (c) free energy F/N, 
and (d) entropy S/N are plotted as functions of inversed temperature /3. The exact 
results given by Ferdinand and Fisher [16] and ours are plotted with solid lines and 
circles, respectively. 

on the joints of short Markov chains because the number of joints 
is much smaller than the total number of samples. However, in 
near future, supercomputers will acquire parallelism of several 
millions or more. For example, if the same code is executed with 
p = 10 6 and M tota i = 10 8 , we have M no d e = 100, which is quite 
small. In this case, there are relatively many joints compared to 
the total number of samples. 

In order to confirm that the proposed algorithm works well even 
when M noc ie is very small, i.e. each constituent Markov chain is 
very short, we perform a simple experiment. In this experiment, 
we generate very short Markov chains with M n0( \ e = 100, and 
just connect the chains together without imposing the detailed 
balance on the joints. This procedure is repeated till the speci- 
fied number of samples have generated to make a multicanonical 
weight. 

Figure [2] is a comparison of energy among three results with 
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Fig. 2. Comparison of energy among the exact (solid line), multicanonical with the 
detailed balance (circles), and multicanonical without the detailed balance (crosses), 
where we have set L = 20, p = 32, M tota i = 10 8 , and M node = 100. 

L = 20: exact (solid line), multicanonical with the detailed bal- 
ance (circles), and multicanonical without the detailed balance 
(crosses). When the detailed balance is imposed on the joints of 
constituent Markov chains (circles), the energy agrees well with 
the exact one. This result is consistent with our intuition be- 
cause it is in principle guaranteed that sampling based on the 
detailed balance gives the correct distribution. On the other hand, 
when the detailed balance is not imposed on the joints (crosses), 
there is slight deviation of energy from the exact one in the high- 
temperature region. 

According to the reweighting formula ( 13.61 ). the low-energy part of 
the multicanonical weight is dominant when j3 is large. In Fig. El 
the crosses show that the generated weight is sufficiently accurate 
in the low-energy region. This is because the low-energy part of 
the weight is generated in the last phase of a weight-generation 
process and therefore accumulation of errors is not large. On the 
other hand, the middle-energy part of the weight may accumulate 
errors because it is updated every iteration with a very short 
Markov chain that does not cover the entire energy space. This 
is the cause of the energy deviation for small (3 when the detailed 
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balance is not imposed on joints. 

Since even this simple Ising model produces considerable errors 
for small M n0( je, one should be careful when treating more compli- 
cated systems. It depends on the shape of the considered multi- 
canonical weight how energy deviates from the true values. When 
M noc ie is small due to massive parallelism, one can obtain better 
accuracy by imposing the detailed balance on joints between con- 
stituent Markov chains. 

Table 2 

Execution (E), operation (O), and communication (C) time of the code for paral- 
lelism p = 2,4,8,16,32,64 have been measured with L = 100 and M tota i = 10 8 
in the first iteration of weight generation. E is the sum of O and C. Scalability of 
execution time is evaluated in units of the p = 2 case. A PC cluster with Xeon 1.50 
GHz processors has been used for the measurements. 



p 


E (sec) 


O (sec) 


C (sec) 


Scalability 


2 


12.32 


11.77 


0.55 


2.00 


4 


6.21 


5.93 


0.28 


3.97 


8 


3.80 


3.59 


0.21 


6.48 


16 


2.19 


1.66 


0.53 


11.25 


32 


1.40 


0.77 


0.63 


17.60 


64 


0.97 


0.43 


0.54 


25.40 



Finally, in Table [2, we show scalability of the code in the case of 
L = 100 and M to ta! = 10 8 , which show ^-dependence of execu- 
tion (E), operation (O), and communication (C) time. We have 
represented the scalability concerning to execution time in units 
of the p = 2 case. The measurements have been performed in 
the first iteration of weight generation. As p increases, execution 
time decreases well, but there is deviation from the ideal scala- 
bility. This is because the total number of operations is not large 
compared to communications and a small part of the operations 
have not been parallelized. Communication time is comparable 
with operation when p is large. For better performance, serial 
operations and collective communications should be removed as 
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much as possible. We could invent more sophisticated implemen- 
tation of the algorithm for complete scalability. We will consider 
improvement of the code in the next paper. 

6 Conclusions 

We have proposed a parallelization algorithm for Markov-chain 
generation, which can be applied to any MCMC-based meth- 
ods. We have verified the algorithm in the two-dimensional Ising 
model combined with the multicanonical method. We have con- 
firmed accuracy of the obtained multicanonical weights by check- 
ing agreement of energy, specific heat, free energy, and entropy 
with the exact results. We have also shown that multicanoni- 
cal weights may have errors if the detailed balance is not used 
for linking short Markov chains generated in parallel. One can 
decrease such errors if unnecessary configurations are discarded 
with the proposed algorithm when connecting constituent Markov 
chains. The algorithm will be useful for highly massive parallelism 
equipped with future supercomputers. 
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